Applying Context-Based Prediction in Adversarial Watkins’ Q(λ)-Learning
نویسندگان
چکیده
This paper exhibits the transformation of Watkins’ Q(λ) learning algorithm into an adversarial Qlearning algorithm. A method called context-based prediction, borrowed from multimedia data coding, is used as opponent modeling and is incorporated in the transformed, CBQ(λ) algorithm. We tested CBQ(λ) by playing against three opponents. The first opponent had no prior knowledge and discovered policies as it played. The second opponent carried prior knowledge and used a fixed policy. The third opponent carried prior knowledge and continued to improve its prior policy as play progressed. CBQ(λ) performed well by playing tic-tac-toe against the three opponents listed previously, when given 10 seconds of simulation time.
منابع مشابه
Temporal Second Difference Traces
Q-learning is a reliable but inefficient off-policy temporal-difference method, backing up reward only one step at a time. Replacing traces, using a recency heuristic, are more efficient but less reliable. In this work, we introduce model-free, off-policy temporal difference methods that make better use of experience than Watkins’ Q(λ). We introduce both Optimistic Q(λ) and the temporal second ...
متن کاملOpposition-Based Q(λ) with Non-Markovian Update
The OQ(λ) algorithm benefits from an extension of eligibility traces introduced as opposition trace. This new technique is a combination of the idea of opposition and eligibility traces to deal with large state space problems in reinforcement learning applications. In our previous works the comparison of the results of OQ(λ) and conventional Watkins’ Q(λ) reflected a remarkable increase in perf...
متن کاملSafe and Efficient Off-Policy Reinforcement Learning
In this work, we take a fresh look at some old and new algorithms for off-policy, return-based reinforcement learning. Expressing these in a common form, we derive a novel algorithm, Retrace(λ), with three desired properties: (1) it has low variance; (2) it safely uses samples collected from any behaviour policy, whatever its degree of “off-policyness”; and (3) it is efficient as it makes the b...
متن کاملMulti-Legged Robot Control Using GA-Based Q-Learning Method With Neighboring Crossover
Recently reinforcement learning has received much attention as a learning method (Sutton, 1988; Watkins & Dayan, 1992). It does not need a priori knowledge and has higher capability of reactive and adaptive behaviors. However, there are some significant problems in applying it to real problems. Some of them are deep cost of learning and large size of actionstate space. The Q-learning (Watkins &...
متن کاملDisguise Adversarial Networks for Click-through Rate Prediction
We introduced an adversarial learning framework for improving CTR prediction in Ads recommendation. Our approach was motivated by observing the extremely low click-through rate and imbalanced label distribution in the historical Ads impressions. We hence proposed a Disguise-AdversarialNetworks (DAN) to improve the accuracy of supervised learning with limited positive-class information. In the c...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2011